For this project, I will be exploring publicly available data (2007-2011) from the LendingClub.com. I will try to explore the dataset and then create models to classify whether a profile would have a high probability of paying back a loan or not.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
pd.set_option('display.max_columns', 500) # This setting would enable pandas to display more columns than default
pd.options.mode.chained_assignment = None # default='warn'
plt.style.use('ggplot')
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier, ExtraTreesClassifier
from sklearn.linear_model import LogisticRegressionCV
from sklearn.naive_bayes import GaussianNB,BernoulliNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC, NuSVC, LinearSVC
from sklearn.model_selection import GridSearchCV, cross_val_score, StratifiedKFold, learning_curve
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.metrics import classification_report,confusion_matrix
data = pd.read_csv('LoanStats3a.csv')
data.head()
data.describe()
There are many columns that have no single value at all (count = 0). These columns need to be dropped first before deciding what features we want to include. Also, there are columns in which significant amount of entries are NaNs, we need to get rid of them as well.
loans = data.dropna(axis = 1, thresh = 0.4*data.shape[0])
loans.info()
Before heading into the data analysis, I want to address a potential 'data leakage' problem I noticed during my first run of the dataset.
Without doing much data preparation, feature selection, and only using sklearn classifier with default parameters, I could already achieve nearly perfect accuracy in classifying between good and bad loans for this dataset. This makes me think that there might be one or several features that have clear indications of borrowers' loan status.
After reading through all of the column descriptions in the dataset, I noticed a column with the name 'total_rec_prncp'(total received principle). Principle means the amount of money borrowed (excluding interest) in a loan. A low total received principle indicates that the lender will not likely be paid back. One can naively assume this represents a bad loan. As a matter of fact, this is probably how lending club determines whether a loan is charged off or not for most of the cases.
To test my thought on this issue, I run a classifier using a dataset with only three columns selected from the whole dataset: 'loan_amnt', 'loan_status', and 'total_rec_prncp'. 'loan_amnt' is basically the principle. 'loan_status' tells us whether a loan is bad or good. After performing the train-test data split, I fit the training data with a logistic regression classifier with cross-validation from sklearn. The precision and recall for the prediction on the test data are shockingly good. The plot below shows the data and decision boundaries to demonstrate a pretty clear cut between the good and bad loans.
test_loans = loans[['loan_amnt', 'loan_status', 'total_rec_prncp']]
test_loans['loan_status'] = test_loans['loan_status'].replace(('Fully Paid',
'Charged Off',
'Does not meet the credit policy. Status:Fully Paid',
'Does not meet the credit policy. Status:Charged Off'),
(0, 1, 0, 1))
# Train, test dataset split. This will be explained later.
stratified = StratifiedShuffleSplit(n_splits=1, test_size=0.3)
for train_set, test_set in stratified.split(test_loans, test_loans['loan_status']):
Train = test_loans.iloc[train_set]
Test = test_loans.iloc[test_set]
test_X_train = Train.drop('loan_status',axis=1)
test_y_train = Train['loan_status']
test_X_test = Test.drop('loan_status',axis=1)
test_y_test = Test['loan_status']
CL = LogisticRegressionCV()
CL.fit(test_X_train, test_y_train)
predictions = CL.predict(test_X_test)
print classification_report(test_y_test,predictions)
from matplotlib.colors import ListedColormap
def plot_decision_boundaries(X, y, X_test, y_test, trained_classifier, mesh_step):
# Color map
colors = ('skyblue', 'coral','mediumpurple', 'lightgreen', 'gray', 'yellow' )
cmap = ListedColormap(colors[:len(np.unique(y))])
# Plotting decision regions
x1_min, x1_max = X.iloc[:, 0].min() - 100, X.iloc[:, 0].max() + 1000
x2_min, x2_max = X.iloc[:, 1].min() - 100, X.iloc[:, 1].max() + 1000
xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, mesh_step),
np.arange(x2_min, x2_max, mesh_step))
Z = trained_classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)
Z = Z.reshape(xx1.shape)
# plot the train dataset
ax = fig.add_subplot(211)
for i, Class in enumerate(np.sort(y.unique())):
ax.scatter(X[y == Class].iloc[:, 0], X[y == Class].iloc[:, 1], c = cmap(i), s = 10, label = Class)
#ax.plot(np.arange(1000, 35000), 0.6*np.arange(1000, 35000), c = 'grey')
ax.contourf(xx1, xx2, Z, alpha=0.2, cmap = cmap)
ax.legend(loc = 'best', fontsize= 14)
ax.set_xlabel('Loan Amount')
ax.set_ylabel('Total Received Principle')
ax.set_title('Decision Regions for Train Data')
# plot the test dataset
ax = fig.add_subplot(212)
for i, Class in enumerate(np.sort(y.unique())):
ax.scatter(X_test[y_test == Class].iloc[:, 0], X_test[y_test == Class].iloc[:, 1], c = cmap(i), s = 10, label = str(Class) + '_test')
ax.contourf(xx1, xx2, Z, alpha=0.2, cmap = cmap)
ax.legend(loc = 'best', fontsize= 14)
ax.set_xlabel('Loan Amount')
ax.set_ylabel('Total Received Principle')
ax.set_title('Decision Regions for Test Data')
fig = plt.figure(figsize = (16,16))
fig.subplots_adjust(hspace = 0.15, top = 0.94, left = 0.08, right = 0.94, bottom = 0.05)
plot_decision_boundaries(test_X_train, test_y_train, test_X_test, test_y_test, CL, 10)
Loan status '0' = 'good loan'(blue) and '1' = 'bad loan'(orange). There is a quite linear boundary that separates the bad and good loans. The decision regions are obtained from a logistic regression classifier. If you plot a line with approximately $0.6 * Loan$_$Amount$, it almost overlaps with the boundary line in the plots above. This could simply mean that if the total received principle is below 60% of the original borrowed amount, lendingclub would determine it as a bad loan.
Why I think including features like 'total_rec_prncp' is bad for the purpose of this project? What I am trying to do is to create a model to predict whether a borrower would pay back his/her loan so that Lendingclub could decide whether to lend him/her money in the first place. At that stage, things like 'total_rec_prncp' could not be known, all we know about the borrower are information such as annual income, purpose of the loan, location, and employment length etc. I should aim for creating a model based on these features since features like 'total_rec_prncp' might contain the results we are trying to predict. This is the 'data leakage' problem.
After deleting features like 'total_rec_prncp' and eading through the 'LCDataDictionary' file plus the informaton gathered from the loans.describe(), I have decided to include the following columns as my features: 'loan_amnt','term','int_rate','installment', 'grade', 'emp_length', 'home_ownership', 'annual_inc', 'loan_status', 'purpose', 'addr_state', 'dti', 'delinq_2yrs', 'inq_last_6mths', 'open_acc', 'pub_rec', 'total_acc', 'revol_bal', 'revol_util', 'pub_rec_bankruptcies'.
test_loans = loans[['loan_amnt','term','int_rate','installment', 'grade', 'emp_length', 'home_ownership', 'annual_inc', 'loan_status', 'purpose', 'addr_state', 'dti', 'delinq_2yrs', 'inq_last_6mths', 'open_acc', 'pub_rec', 'total_acc', 'revol_bal', 'revol_util', 'pub_rec_bankruptcies']]
test_loans.head()
Column 'loan_status' has four levels, 'Fully Paid', 'Charged Off','Does not meet the credit policy Status:Fully Paid' and 'Does not meet the credit policy. Status:Charged Off'. 'Charged Off' means the amount of debt is unlikely to be collected from the borrower, which is bad. This is what we want to classify and predict. Lets make 'Fully Paid' and 'Does not meet the credit policy Status:Fully Paid' good loans and assign them as 0. 'Charged Off' and 'Does not meet the credit policy. Status:Charged Off' are bad loans and assign them as 1.
test_loans['loan_status'].unique()
test_loans['loan_status'] = test_loans['loan_status'].replace(('Fully Paid',
'Charged Off',
'Does not meet the credit policy. Status:Fully Paid',
'Does not meet the credit policy. Status:Charged Off'),
(0, 1, 0, 1))
Convert intrest rate into numeric form
test_loans['int_rate'] = test_loans['int_rate'].str.rstrip('%').astype('float') / 100.0
Convert revolving line utilization rate into numeric form
test_loans['revol_util'] = test_loans['revol_util'].str.rstrip('%').astype('float') / 100.0
test_loans.head()
Move cursor onto graph for detail break down
import cufflinks as cf
cf.go_offline()
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
pie = test_loans['loan_status'].value_counts().to_frame().reset_index()
pie.columns = ['labels','values']
pie = pie.replace((0 ,1),('good loans', 'bad loans'))
pie.iplot(kind='pie',labels='labels',values='values',pull=.1,hole=.4, showlegend = False,
textinfo='label+percent', text='Loan Ratio', textposition='inside', title = 'Percentage of bad and good loans'+'<br> (Hover for breakdown)')
The two class: good loans and bad loans, are imbalanced. Later when we split the train and test data sets, we want to do it in a way that they have the same class ratio for both of them.
def plot_pie_charts(feature, feature_name):
pie = test_loans[feature].value_counts().to_frame().reset_index()
pie.columns = ['labels','values']
pie.iplot(kind='pie',labels='labels',values='values', pull=.05,
textinfo='label+percent', textposition='inside', colorscale='Set1', showlegend = False, title = "Distribution of "+ feature_name +'<br> (Hover for breakdown)')
pie1 = test_loans[test_loans['loan_status'] == 0][feature].value_counts().to_frame().reset_index()
pie1.columns = ['labels','values']
pie2 = test_loans[test_loans['loan_status'] == 1][feature].value_counts().to_frame().reset_index()
pie2.columns = ['labels','values']
fig = {
"data": [
{
"values": pie1['values'],
"labels": pie1['labels'],
"domain": {"x": [0, .48]},
"textposition":"inside",
"textinfo":"label+percent",
"hole": .4,
"type": "pie"
},
{
"values": pie2['values'],
"labels":pie2['labels'],
"domain": {"x": [.52, 1]},
"textinfo":"label+percent",
"hole": .4,
"type": "pie"
}],
"layout": {
"title":"Distribution of "+ feature_name + ' by Loan Status'+'<br> (Hover for breakdown)',
"showlegend": False,
"annotations": [
{
"font": {
"size": 16
},
"showarrow": False,
"text": "Good Loans",
"x": 0.18,
"y": 0.5
},
{
"font": {
"size": 16
},
"showarrow": False,
"text": "Bad Loans",
"x": 0.81,
"y": 0.5
}
]
}
}
iplot(fig, filename = 'pie_chart_subplots')
plot_pie_charts('purpose', 'Loan Purpose')
The top Pie chart is the distribution of loan purpose for all the loans. Nearly 50% of the people who borrow money from Lendingclub use the money on debt consolidation. Close to 13% use the money to pay their credit cards. More than 10% of the loans from 2007-2011 has no disclosed purpose. The use of home improvement contributes another ~ 7%. The rest 20% of the people used the money on things such as small business, car, purchase, wedding, vacation, and education etc.
The bottom two charts show the distribution of loan purpose for bad and good loans separately. A noticeable difference is that in the bad loans, loans being used on small business are more likely to default. This is somewhat understandable as small business usually has a high risk of failure which could result in the loan default. Besides this, there is no major difference between the distribution of loan purpose for bad and good loans.
box = test_loans[['loan_amnt','purpose']].pivot(columns='purpose', values='loan_amnt')
box.iplot(kind = 'box',colorscale='paired',boxpoints ='suspectedoutliers' , yaxis_title = 'Loan Amount',
showlegend = False, title = 'Loan Amount by Loan Purpose <br> (Hover for breakdown)')
It is interesting to see how much money people borrow from Lendingclub for different purposes. The median loan amount for debt consolidation is 11k, and concentrated between 7000 and 16.5k. Loans used to pay credit card and home improvement have similar profiles with a median loan amount around 10k. Loans used on small business which also see a big increase in percentage among bad loans have the highest median loan amount of 12k and concentrated between 6500 and 20k.
box = test_loans[['int_rate','purpose']].pivot(columns='purpose', values='int_rate')
box.iplot(kind = 'box',colorscale='paired',boxpoints ='suspectedoutliers' ,
yaxis_title = 'Interest rate', showlegend = False, title = 'Interest Rate by Loan Purpose <br> (Hover for breakdown)')
Interest rate usually can be an indicator of risk. It is often based on a borrower's income level, credit score, loan amount, loan term, employment situation etc. Again, loan for small business has the highest median interest rate of 13.1% and mostly is ranged from 10.37% to 16%. On the other hand, the median interest rate for debt consolidation and credit card is 12.61% and 11.7% respectively. The lowest interest rate can be found in the loans on cars with only 10.4%.
bar = test_loans[['purpose','loan_status']].pivot_table(columns='loan_status',index='purpose', aggfunc=np.count_nonzero)
bar.columns=['good loan','bad loan']
bar_copy = bar.copy()
bar_copy['good loan'] = 100*bar_copy['good loan']/bar.sum(axis=1)
bar_copy['bad loan'] = 100*bar_copy['bad loan']/bar.sum(axis=1)
bar_copy.iplot(kind='bar', yaxis_title = 'Percentage in Each Catalog', title = 'Loan status by Loan Purpose <br> (Hover for breakdown)')
Bar plot above shows how different loan purpose affect the loan status. The ratio between good and bad loans in small business is close to 1.3 (57% vs 43% ), while in 'debt consolidation' the ratio is ~ 2.8 and it is even higher in 'crediit card' with a ratio of 4 (80% vs 20%). This is consistant with what we have been seen that loans used on small business are very likely to default when compared to others. 'Education' is another place a high percentage of bad loans can be seen. Although, 'education' only makes up ~ 1% of the total loans with one of the lowest median loan amount (5k), there are ~ 35% of the loans on 'education' are bad.
box = test_loans[['annual_inc','purpose']].pivot(columns='purpose', values='annual_inc')
box.iplot(kind = 'box',colorscale='paired',boxpoints ='suspectedoutliers' , yaxis_title = 'Annual Income', yaxis_range = [0,200000], showlegend = False)
plot_pie_charts('term', 'Loan Terms')
Most of the loans (74%) are on the 36 months term. The percentage drops to 60% among bad loans. It looks like longer-term loans are more likely to default.
box = test_loans[['int_rate','term']].pivot(columns='term', values='int_rate')
box.iplot(kind = 'box',colorscale='paired',boxpoints ='suspectedoutliers' ,
yaxis_title = 'Interest rate', showlegend = False, title = 'Interest Rate by Loan Term <br> (Hover for breakdown)')
The interest rate of longer term loans is higher with a median value of close to 15%, while the median interest rate of shorter loan term is only 11%.
box = test_loans[['loan_amnt','term']].pivot(columns='term', values='loan_amnt')
box.iplot(kind = 'box',colorscale='paired',boxpoints ='suspectedoutliers' ,
yaxis_title = 'Loan Amount', showlegend = False, title = 'Loan Amount by Loan Term <br> (Hover for breakdown)')
There is a huge difference in loan amount for these two terms. Longer-term loan has a median value of ~15k, on the other hand, people only borrow half of the amount for 36 months loan. This is not too surprising, in general, people would choose to pay the bigger loan back in a longer time so that it won't pose too much pressure on their financial stability.
plot_pie_charts('home_ownership', 'Home Ownership')
The majority of the borrowers either rent or paying the mortgage for their homes. These two combined contribute more than 90% of the loans. A small fraction of borrowers actually owns houses. There is no significant difference in the home ownership distribution among good and bad loans.
box = test_loans[['annual_inc','home_ownership']].pivot(columns='home_ownership', values='annual_inc')
box.iplot(kind = 'box',boxpoints ='suspectedoutliers', yaxis_title = 'Annual Income', yaxis_range = [0,200000], showlegend = False)
Let's first check out the two most common home ownership types among all the borrowers, 'Rent' and 'Mortgage'. People who are paying mortgage earn a higher median annual income of 70K with most of them ranged from 51k to 99k. Meanwhile, borrowers who pay rent have a median annual income of 50K and most of them are ranged from only 35k to 70k. Although people who have no home in this dataset have the highest median income, we have to understand that there is only a tiny number of samples of such people in the entire dataset (there are only 8 data entries). These people could not be used to represent the real situation of 'homeless' people.
plot_pie_charts('emp_length', 'Employment Length')
More than 20% of the people who borrow money from Lendingclub have worked more than 10 years. People who have been employed for less than 1 year, 2 years, 3 years, 4 years, and 5 years makes up ~ 10% each. No major difference can be seen in the composition for either good or bad loans.
box = test_loans[['annual_inc','emp_length']].dropna(axis=0, how='any')
box = box.pivot(columns='emp_length', values='annual_inc')
box = box[[ '< 1 year','1 year', '2 years', '3 years', '4 years', '5 years',
'6 years', '7 years', '8 years', '9 years','10+ years' ]]
box.iplot(kind = 'box',colorscale='paired', boxpoints ='suspectedoutliers' , yaxis_title = 'Annual Income', yaxis_range = [0,150000], showlegend = False)
As expected, in general, your income would get higher when you have worked longer aided with longer experience. We can clearly see an increasing trend of median annual income with respect to employment length. People have worked less than a year earn 50k on average, while the number increases to 70K for people who have been employed for more than 10 years.
box = test_loans[['loan_amnt','emp_length']].dropna(axis=0, how='any')
box = box.pivot(columns='emp_length', values='loan_amnt')
box = box[[ '< 1 year','1 year', '2 years', '3 years', '4 years', '5 years',
'6 years', '7 years', '8 years', '9 years','10+ years' ]]
box.iplot(kind = 'box',colorscale='paired', boxpoints ='suspectedoutliers' , yaxis_title = 'Loan Amount', showlegend = False)
People with longer employment length tend to borrow more money. Since we have seen that they can also make more money so that they can afford a higher loan. I am wondering whether they have different default rate?
bar = test_loans[['emp_length','loan_status']].pivot_table(columns='loan_status',index='emp_length', aggfunc=np.count_nonzero)
bar.columns=['good loan','bad loan']
bar = bar.reindex([ '< 1 year','1 year', '2 years', '3 years', '4 years', '5 years',
'6 years', '7 years', '8 years', '9 years','10+ years' ])
bar_copy = bar.copy()
bar_copy['good loan'] = 100*bar_copy['good loan']/bar.sum(axis=1)
bar_copy['bad loan'] = 100*bar_copy['bad loan']/bar.sum(axis=1)
bar_copy.iplot(kind='bar', yaxis_title = 'Percentage in Each Catalog')
So it turns out that the percentage of bad loans among people with different employment length are rather similar. I guess whether you have been employed for long isn't going to affect your ability to pay back loan that much.
plot_pie_charts('grade', 'Grade')
Lendingclub assigned grade B to most of its borrowers, which consists of nearly 30% of the entire borrowers. It is followed by 'A' and 'C' with 24% and 21% respectively. The pecentages of lower grades see big rises in the bad loans. While the best grade 'A' dropped to only 9.5%. It is pretty safe to say grade can be a very strong indicator for distinguishing loans.
bar = test_loans[['grade','loan_status']].pivot_table(columns='loan_status',index='grade', aggfunc=np.count_nonzero)
bar.columns=['good loan','bad loan']
bar_copy = bar.copy()
bar_copy['good loan'] = 100*bar_copy['good loan']/bar.sum(axis=1)
bar_copy['bad loan'] = 100*bar_copy['bad loan']/bar.sum(axis=1)
bar_copy.iplot(kind='bar', yaxis_title = 'Percentage in Each Catalog')
The relation between loan status and grade can be better demonstrated with the plot above. The percentage of bad loans increases along the degradation of grades. Bad loans consists only 11% among borrowers with the highest grade, but the value shoot up to 50.5% for the lowst grade. Higher interest means the borrower need to pay more money besides the principle, especially in the early days, a large fraction of payment went into the interest. This creates much larger pressure on the borrower when compared to a person who borrowed a same amount of loan but with a lower interest.
box = test_loans[['int_rate','grade']].pivot(columns='grade', values='int_rate')
box.iplot(kind = 'box',colorscale='paired',boxpoints ='suspectedoutliers' , yaxis_title = 'Interest rate', showlegend = False)
A strong correlation can be seen between grades and median interest rate. Grade 'G' has a median interest rate of 20.5%, while the median interest rate for grade A is only 7%. The highest interest rate of grade 'C' is even lower than the lowest interest rate of grade 'G'.
temp = test_loans[['annual_inc','loan_amnt','int_rate','installment', 'loan_status']]
temp['loan_status'] = temp['loan_status'].replace((0,1),('good loan','bad loan'))
g = sns.PairGrid(temp[temp['annual_inc']<300000], hue='loan_status')
g.map_diag(plt.hist, bins = 30, histtype="step", linewidth=3)
g.map_offdiag(plt.scatter, s=5)
g.add_legend()
g.fig.set_size_inches(16,16)
In the plots above, I try to explore whether there is any relation between 'annual income', 'loan amount', 'interest rate', and 'installment'. Also, we could see the distribution of these features with good and bad loans separated.
The first (count from left) histogram plot on the diagonal is the distribution of borrowers' annual income with red represents the good loans and blue for the bad loans. A similar gaussian-like distribution with an extended tail can be seen for both good and bad loans. They all peaked around 50K.
The second histogram plot on the diagonal shows spike-like features in the distribution of loan amount. I believe this is due to the fact that people tend to borrow an amount that is the multiple of a thousand or a hundred. This is better shown in the plot below. Again, the distributions for both good and bad loans generally resemble a long-tailed normal curve with a peak around 5000.
The distribution of interest rate is a little different, especially for the good loans. A closer look at the apparent two peaks at ~ 7% and ~ 11% we could link them with grade 'A' and 'B' which happen to have a median interest rate at those values. And indeed, people with grade 'A' and 'B' make up half of the population. A slightly lower peak around 13% can probably be associated with grade 'C'. The distribution of interest rate for bad loans has a rather broad curve around the top and is peaked around 15%, where grade 'C' ,'D', and 'E' cluster.
In all the above plots, bad loans and good loans generally overlap with each other without any clear boundary that could set them apart too well. But we do see a few interesting offests among their distributions.
g = sns.FacetGrid(test_loans[['annual_inc','loan_amnt','loan_status']],size=12, hue='loan_status',palette='Set1')
g.map(plt.scatter, 'annual_inc', 'loan_amnt', s=15)
g.add_legend()
g.axes[0, 0].plot(np.arange(50000), 0.8*np.arange(50000), c = 'Green')
g.set(xlim=(0,310000))
g.set(ylim=(0,36000))
The green line in the plot above is 80% of the borrowers' annual income. I don't know whether this is the limit set by Lendingclub or not, apparently, the amount of loan one can borrow should not exceed 80% of the borrower's income with only a few exceptions. Again, bad loans are in blue and good ones are in red. We see more points sit on lines like 5000, 10000, and 15000 etc. This is consistent with our findings from the histogram before. Bad loans cluster at the lower end of the annual income, but this could be simply due to the much smaller sample size of bad loans.
box = test_loans[['int_rate','loan_status']].pivot(columns='loan_status', values='int_rate')
box.iplot(kind = 'box',colorscale='paired',boxpoints ='suspectedoutliers' ,
yaxis_title = 'Interest Rate', showlegend = False, title = 'Interest Rate by Loan Status <br> (Hover for breakdown)')
box = test_loans[['annual_inc','loan_status']].pivot(columns='loan_status', values='annual_inc')
box.iplot(kind = 'box',colorscale='paired',boxpoints ='suspectedoutliers' ,yaxis_range = [0,160000],
yaxis_title = 'Annual Income', showlegend = False, title = 'Annual Income by Loan Status <br> (Hover for breakdown)')
The two plots above show us that good loans have a lowere median interest rate and a slightly higher median annual income level. High income indicates a higher ability to pay back loans. With the help of a lower interest rate, it means a borrower's hard earned money would mostly go to paying the principle instead of interest. This creates a positive feedback and healthier financial status.
g = sns.FacetGrid(test_loans[['installment','loan_amnt','term']],size=12, hue='term',palette='Set1')
g.map(plt.scatter, 'loan_amnt','installment', s=15)
g.add_legend()
g.axes[0, 0].plot(np.arange(35000), 0.0278*np.arange(35000), c = 'Green')
g.axes[0, 0].plot(np.arange(35000), 0.0167*np.arange(35000), c = 'Black')
I was interested in how much people would like to pay for their installments. Since there are only two terms in this dataset, 36 and 60 months, I plot the lines where people would pay 1/36 (green) or 1/60 (black) of their total loans as installment. In reality, installment includes interests and principle, that's why most of the points are above these two lines respectively. Points underneath are likely representing people who put a large amount of down payment to lower their monthly installment.
#g = sns.lmplot(x='loan_amnt', y='int_rate',col ='loan_status',row = 'purpose', data = test_loans, size=5, hue='loan_status',palette='tab10', scatter_kws={'s': 5})
temp = test_loans[['annual_inc','loan_amnt','int_rate','emp_length', 'addr_state']]
temp['emp_length']=temp['emp_length'].replace(('< 1 year', '1 year', '2 years','3 years', '4 years', '5 years', '6 years', '7 years', '8 years', '9 years', '10+ years'),
(0, 1,2,3,4,5,6,7,8,9,10))
temp['int_rate'] = temp['int_rate']*100
temp = temp.groupby(['addr_state'],as_index = False).mean()
temp1 = test_loans[['loan_amnt', 'addr_state']].groupby(['addr_state'],as_index = False).sum()
temp2 = test_loans[['loan_amnt', 'addr_state']].groupby(['addr_state'],as_index = False).count()
temp1.columns=['addr_state','total_loan_amnt']
temp2.columns=['addr_state','total_num_loan']
temp = pd.concat([temp,temp1['total_loan_amnt'],temp2['total_num_loan'] ], axis=1)
temp = temp.round(decimals=2)
temp_copy=temp.copy()
for col in temp_copy.columns:
temp_copy[col] = temp_copy[col].astype(str)
colorscale = [[0.0,'rgb(166,189,219)'],[0.2,'rgb(116,169,207)'],
[0.4,'rgb(54,144,192)'],[0.6,'rgb(5,112,176)'],
[0.8,'rgb(4,90,141)'],[1.0,'rgb(2,56,88)']]
temp_copy['text'] = temp_copy['addr_state'] + '<br>' +\
'Average Annual Income: '+temp_copy['annual_inc']+'<br>' +\
'Average Loan Amount: '+temp_copy['loan_amnt']+'<br>'+\
'Average Interest Rate(%): '+temp_copy['int_rate']+'<br>' +\
'Average Employment Length: ' + temp_copy['emp_length']+'<br>'+\
'Total Loan Amount: '+temp_copy['total_loan_amnt']+'<br>'+\
'Total Number of Loans: '+temp_copy['total_num_loan']
data = [ dict(
type='choropleth',
colorscale = colorscale,
autocolorscale = False,
locations = temp_copy['addr_state'],
z = temp_copy['total_loan_amnt'].astype(float),
locationmode = 'USA-states',
text = temp_copy['text'],
marker = dict(
line = dict (
color = 'rgb(255,255,255)',
width = 2
) ),
colorbar = dict(
title = 'Total Loans'+'<br>'+\
'Issued in USD')
) ]
layout = dict(
title = 'Loans Borrowed from Lending Clubs in the US <br>(Hover for breakdown. Drag & Zoom)',
geo = dict(
scope='usa',
projection=dict( type='albers usa' ),
showlakes = True,
lakecolor = 'rgb(255, 255, 255)'),
)
fig = dict( data=data, layout=layout )
iplot( fig, filename='cloropleth-map1')
def sort_max(df,Features):
temp_max = pd.DataFrame()
for f in Features:
a = pd.DataFrame({f: [df.sort_values(f, ascending=False)[f].values[0:3], df.sort_values(f, ascending=False)['addr_state'].values[0:3], np.around(df[f].mean(),decimals=2)]})
temp_max = pd.concat([temp_max, a], axis=1)
return temp_max
Features = ['annual_inc','loan_amnt','int_rate','emp_length', 'total_loan_amnt','total_num_loan']
max_by_state = sort_max(temp, Features)
max_by_state.columns = ['Annual Income', 'Loan Amount', 'Interest Rate (%)', 'Employment Length', 'Total Loan Amount', 'Total Number of Loans']
max_by_state.index=['Top 3', 'Top 3 States', 'National Average' ]
d = dict(selector="th", props=[('text-align', 'center')])
max_by_state.style.set_properties(**{'width':'10em', 'text-align':'center'}).set_table_styles([d])
California leads all states in the total amount of loan borrowed from Lendingclub by a very large margin and is followed by New York and Texas. The large number of loans from these states plays a big role in affecting the total loan amount. Nevertheless, all three states have above average income, loan amount, interest rate, and number of loans. The fact that people from these states can earn way more money than the average, it's no surprise that they are able to borrow more money. The average employment length of California and New York are above the national average, however, the number of Texas is lower than the national average. Although Florida comes in third in the total number of loans and fourth in the total amount of loans borrowed, it only has an around average income level, loan amount and interest rate. Borrowers from Alaska, Rhode Island, and Arkansas have been employed for more than one year longer than the national average. Alaska is also high in average interest rate, loan amount, and annual income.
Hover over the state you are interested in and check out their numbers.
temp = pd.crosstab(test_loans['addr_state'], test_loans['loan_status']).reset_index()
temp.columns=['addr_state', 'good_loan_amnt', 'bad_loan_amnt']
bad_loan_percentage=pd.DataFrame({'bad_loan_percentage':100*temp['bad_loan_amnt']/(temp['good_loan_amnt']+temp['bad_loan_amnt'])})
temp = pd.concat([temp[['addr_state', 'bad_loan_amnt']],bad_loan_percentage],axis=1)
temp1 = test_loans[['dti', 'int_rate','addr_state']].groupby(['addr_state'],as_index = False).mean()
temp = pd.concat([temp,temp1[['dti', 'int_rate']]], axis=1)
temp['int_rate'] = temp['int_rate']*100
temp = temp.round(decimals=2)
temp_copy=temp.copy()
for col in temp_copy.columns:
temp_copy[col] = temp_copy[col].astype(str)
colorscale = [[0.0,'rgb(4,90,141)'],[0.125,'rgb(5,112,176)'],[0.250,'rgb(54,144,192)'],
[0.375,'rgb(116,169,207)'],[0.5,'rgb(208,209,230)'],
[0.625,'rgb(247,247,247)'],[0.750,'rgb(253,219,199)'],
[0.875,'rgb(244,165,130)'],[1.0,'rgb(214,96,77)']]
temp_copy['text'] = temp_copy['addr_state'] + '<br>' +\
'Number of Bad Loans: '+temp_copy['bad_loan_amnt']+'<br>' +\
'Percentage of Bad Loans: '+temp_copy['bad_loan_percentage']+'<br>'+\
'Average Debt-to-Income Ratio: '+temp_copy['dti']+'<br>'+\
'Average Interest Rate: '+temp_copy['int_rate']
data = [ dict(
type='choropleth',
colorscale = colorscale,
autocolorscale = False,
locations = temp_copy['addr_state'],
z = temp_copy['bad_loan_percentage'].astype(float),
locationmode = 'USA-states',
text = temp_copy['text'],
marker = dict(
line = dict (
color = 'rgb(255,255,255)',
width = 2
) ),
colorbar = dict(
title = 'Percentage of' +'<br>'+\
'Bad Loans')
) ]
layout = dict(
title = 'Default Risk of Lending Clubs in the US <br>(Hover for breakdown. Drag & Zoom)',
geo = dict(
scope='usa',
projection=dict(type='albers usa' ),
showlakes = True,
lakecolor = 'rgb(255, 255, 255)'),
)
fig = dict( data=data, layout=layout )
iplot( fig, filename='cloropleth-map2' )
Features = ['bad_loan_percentage','dti', 'int_rate']
max_by_state = sort_max(temp, Features)
max_by_state.columns = ['Bad Loan Percentage', 'Debt-to-Income Ratio', 'Average Interest Rate (%)']
max_by_state.index=['Top 3', 'Top 3 States', 'National Average' ]
d = dict(selector="th", props=[('text-align', 'center')])
max_by_state.style.set_properties(**{'width':'10em', 'text-align':'center'}).set_table_styles([d])
Loans borrowed in Nebraska, Indiana, and Nevada have the highest chance to default. The small total number of loans from Nebraska and Indiana might contribute to their high bad loan percentage, on the other hand, Nevada has a pretty large number of loans to make the statistics significant. Despite having the highest debt-to-income ratio, Idaho, Arkansas, and West Virginia all have below national average bad loan percentage. Texas and New York, two of the states with the most loan amounts, have very healthy loan status. Their bad loan percentages are all below national average. However, California has a slightly higher default rate of 16.58%, just above the national average. Considering California borrowed more than the second and third state combined loan amounts, its bad loan percentage is actually very impressive. Top 3 states that have the highest interest rate are not doing well in their loan status either. Alaska, Nebraska, and Hawaii are among the states with the highest possibility to default a loan. Higher interest rate does pose a higher risk.
Hover over the state you are interested in and check out their numbers.
I have done a bit of data cleaning and preparation at the beginning of section 2. However, before heading into the data modeling, we still need to handle things like missing data, selecting features, feature scaling, and converting categorical features etc.
fig = plt.figure(figsize = (16,9))
sns.heatmap(test_loans.isnull(),
yticklabels = False,
cbar = False,cmap='plasma')
There are some data missing in the 'Employment length' and 'number of bankruptcies in public record'. Considering that we do have many data already, we could just drop the rows with missing entries. However, I will try to fill in the missing part for the 'Employment length' column.
test_loans['emp_length'].unique()
First, let's see whether there is a realtion between one's annual income and its employment length.
fig = plt.figure(figsize = (16,9))
sns.boxplot(y = 'emp_length',
x = 'annual_inc',
data = test_loans,
palette = 'Blues',
order=['< 1 year', '1 year', '2 years','3 years', '4 years', '5 years', '6 years', '7 years', '8 years', '9 years', '10+ years'])
plt.xlim(0,200000)
It kind of make sense that one with higher annual income tends to be in the work for a longer time. Ideally, I could try to do a simple linear or curve fit for this trend and use it to fill in the missing employment length data.
However, let's see whether there are other ways to quickly filling the missing data.
num_emp_len = pd.DataFrame({'num_emp_len':test_loans['emp_length'].replace(('< 1 year', '1 year', '2 years','3 years', '4 years', '5 years', '6 years', '7 years', '8 years', '9 years', '10+ years'),
(0, 1,2,3,4,5,6,7,8,9,10))})
test_loans=pd.concat([test_loans, num_emp_len], axis=1)
fig = plt.figure(figsize = (16,9))
sns.boxplot(x = 'grade',
y = 'num_emp_len',
data = test_loans,
palette = 'Blues',
order=['A', 'B', 'C' ,'D' ,'E' ,'F' ,'G'])
LendingClub assigned grade does not distinguish borrowers' employment length too well.
fig = plt.figure(figsize = (16,9))
sns.boxplot(x = 'home_ownership',
y = 'num_emp_len',
data = test_loans,
palette = 'Blues',
)
It seems like different types of home ownership tend to have different length of employment (median value). This kind of makes sense as well. People who have a lower income would tend to pay rent instead of buying or have the ability to own a house already. So let's put the median value of employment length from the above chart to fill in the missing data.
def impute_emp_len(cols):
num_emp_length = cols[0]
home_ownership = cols[1]
if pd.isnull(num_emp_length):
if home_ownership == 'RENT':
return 3
elif home_ownership == 'OWN':
return 5
elif home_ownership == 'MORTGAGE':
return 6
elif home_ownership == 'OTHER':
return 2.5
else:
return 0
else:
return num_emp_length
final_loans = test_loans.copy()
final_loans['num_emp_len'] = final_loans[['num_emp_len','home_ownership']].apply(impute_emp_len,axis=1)
Let's drop the original 'emp_length' and the rows has NaN in 'pub_rec_bankruptcies'. We should see no missing data after this
final_loans.drop('emp_length',axis=1,inplace=True)
final_loans.dropna(axis=0, how='any',inplace=True)
'loan_amnt','installment','annual_inc', 'revol_bal' all have much larger value than the rest of the columns. Let's take the logarithmic value of them for future ML models e.g. logistic regression, svm, pca, knn etc. Ideally, I should do feature scaling for these distance-based models. But this would be good enough.
def log(df):
if df != 0:
return np.log(df)
else:
return df
final_loans[['log_loan_amnt','log_installment','log_annual_inc', 'log_revol_bal']] = final_loans[['loan_amnt','installment','annual_inc', 'revol_bal']].applymap(log)
final_loans.drop(['loan_amnt','installment','annual_inc', 'revol_bal'],axis=1,inplace=True)
final_loans.info()
Note that 'term, grade, home_ownership, purpose, addr_state' have 'object' as their data types. They are the so called categorical features. We need to transform them into data types that sklearn can understand.
There are many methods that you can transform categorical features into numerical variables. A very good article has discussed several common methods and has done a comparison between them. See (http://www.willmcginnis.com/2015/11/29/beyond-one-hot-an-exploration-of-categorical-variables/). The author has developed a set of scikit-learn-style transformers for encoding categorical variables into numeric with different techniques (http://contrib.scikit-learn.org/categorical-encoding/). I am going to use it here.
I am usually against the ordinal coding, in which we just assign an integer to each category. It is appealing that this method does not add any extra dimensions to the problem. However, by doing this, it implies an order to the variable that may not actually exist. I did this for the 'employment length' feature as there is an order among those variables.
Another method I usually end up using is the 'one-hot coding', which can be achieved by applying 'get_dummies' function in Pandas. In this problem, we still have 'term', 'grade', 'addr_state','home_ownership', 'purpose' that are categorical features. Especially for the 'addr_state', if we use the 'one-hot coding' method, we will end up with an extra 48 features (Since this column covers most of the states in the US). The 'purpose' column would add another 13 features. Higher dimensions usually mean we need to fit parameters in a more complex space. It could slow down algorithms or even make them unsolvable.
I will try the binary encoding in this problem. It will add fewer dimensions than the one-hot coding and may provide better results in general.
import category_encoders as ce
encoder = ce.BinaryEncoder(['term', 'grade', 'addr_state','home_ownership', 'purpose'])
encoder.fit(final_loans.drop('loan_status',axis=1),final_loans['loan_status'])
ult_loans = pd.concat([encoder.transform(final_loans.drop('loan_status', axis=1)), final_loans['loan_status']],axis=1)
ult_loans.head()
ult_loans.info()
If you want to use the 'one-hot coding', run the code below. It will give us 90 total features with 41112 samples, compared to 30 features from the 'binary coding'.
#ult_loans = pd.get_dummies(final_loans, columns = ['term', 'grade', 'addr_state','home_ownership', 'purpose'])
Train Test Split
ult_loans.to_csv('ult_loans.csv')
# Loan Ratios (Imbalanced classes)
ult_loans['loan_status'].value_counts()/len(ult_loans['loan_status']) * 100
We can see that the good loans (0) makes up the majority of the data set. We want to split the train, test set so that they have the same class ratio for both of them.
stratified = StratifiedShuffleSplit(n_splits=1, test_size=0.3, random_state=0)
for train_set, test_set in stratified.split(ult_loans, ult_loans['loan_status']):
Train = ult_loans.iloc[train_set]
Test = ult_loans.iloc[test_set]
print 'Train set loan ratio'
print str(Train['loan_status'].value_counts()/len(Train['loan_status']) * 100)
print 'Test set loan ratio'
print str(Test['loan_status'].value_counts()/len(Test['loan_status']) * 100)
X_train = Train.drop('loan_status',axis=1)
y_train = Train['loan_status']
X_test = Test.drop('loan_status',axis=1)
y_test = Test['loan_status']
Training the Random Forest model
RForestC=RandomForestClassifier(random_state=None)
RForestC.fit(X_train,y_train)
Predictions and Evaluation
predictions=RForestC.predict(X_test)
print classification_report(y_test,predictions)
During this part, I will be using 10-fold cross-validation to evaluate each model's performance.
Stratified_K_Fold = StratifiedKFold(n_splits=10)
random_state = 101
# List of classifiers
MLM = [
#Ensemble
AdaBoostClassifier(random_state=random_state),
GradientBoostingClassifier(random_state=random_state),
ExtraTreesClassifier(random_state=random_state),
RandomForestClassifier(random_state=random_state),
#LR
LogisticRegressionCV(random_state=random_state),
#Naive Bayes
BernoulliNB(),
GaussianNB(),
#Nearest Neighbor
KNeighborsClassifier(),
#SVM
SVC(cache_size = 2000, probability=True, random_state=random_state),
#NuSVC(cache_size = 2000,probability=True,random_state=random_state),
LinearSVC(random_state=random_state),
#Trees
DecisionTreeClassifier(random_state=random_state),
#Discriminant Analysis
LinearDiscriminantAnalysis(),
QuadraticDiscriminantAnalysis(),
#Multilayer perceptron (neural network)
MLPClassifier(random_state=random_state)
]
cv_accuracy = []
for estimator in MLM:
print 'running ' + str(estimator).split('(')[0]
cv_accuracy.append(cross_val_score(estimator, X_train, y = y_train, scoring = 'accuracy', cv = Stratified_K_Fold, n_jobs = -1 ))
cv_mean = []
cv_std = []
for accuracy in cv_accuracy:
cv_mean.append(np.mean(accuracy))
cv_std.append(np.std(accuracy))
cv_result=pd.DataFrame({'classifier':['AdaBoost','GradientBoosting','ExtraTrees','RandomForest','LogisticRegressionCV','BernoulliNB','GaussianNB','KNeighbors', 'SVC','LinearSVC','DecisionTree','LinearDiscriminantAnalysis','QuadraticDiscriminantAnalysis','MLP'], 'Cross Validation Accuracy Mean':cv_mean,'Cross Validation Accuracy Std':cv_std})
cv_result=cv_result.sort_values('Cross Validation Accuracy Mean', ascending=False)
cv_result
fig = plt.figure(figsize = (16,9))
fig = sns.barplot('Cross Validation Accuracy Mean', 'classifier', data = cv_result, palette='Set3', orient = 'h',**{'xerr':cv_result['Cross Validation Accuracy Std']})
fig.set_xlabel('Cross Validation Mean Accuracy')
It seems like that a bunch of models can achieve accuracy ~85%, while others are underperformaning. However, here is a bigger question, is 85% accuracy means a good model in this problem?
Let's see how much is the baseline accuracy. Recall that when we split the train, test sets, we have calculated the percentiles of good loans and bad loans in our dataset. 85.21% of the dataset are good loans and the rest are bad. A very naive model in which we predict every loan is good would give us 85.21% accuracy already. In all the models we have tried above, only SVC, GradientBoosting, and LogisticRegressionCV came close to that number. Granted that I was only using the default parameters to train the models, still, this could mean that all of my model choices are bad, choice of features is poor or there isn't too many useful information to enable us to make predictions convincingly.
I have explained at the beginning of this notebook that I decide to exclude features like 'total received principle' which might introduce a data leakage problem to the model. Otherwise, I would get perfect predictions all the time. For the sake of learning and practicing, I will try to tune the model parameters and evaluate their performance anyway. Later, I might try to do more feature engineering to see whether there is any chance to further improve the result.
The four models I choose to tune the parameters are 'Adaboost', 'Random Forest', 'Gradient Boosting', and 'Support Vector Classifier'.
# Adaboost Classifier Parameters tunning
AdaBoost = AdaBoostClassifier()
ada_param_grid = {'algorithm' : ['SAMME.R'],
'n_estimators' :[10,50,100],
'learning_rate': [0.3, 0.75, 1, 1.5],
}
gsAdaBoost = GridSearchCV(AdaBoost, param_grid = ada_param_grid, cv=Stratified_K_Fold, scoring='accuracy', n_jobs = -1, verbose = 1)
gsAdaBoost.fit(X_train,y_train)
ada_best = gsAdaBoost.best_estimator_
gsAdaBoost.best_score_
# Random Forest Classifier Parameters tunning
RForestC = RandomForestClassifier()
RFC_param_grid = {'max_depth': [5,10],
'max_features': ['log2',None],
'min_samples_split': [2, 10],
'min_samples_leaf': [10,50],
'bootstrap': [False],
'n_estimators' :[200,500],
'criterion': ['gini'],
'n_jobs': [2],
}
gsRForestC = GridSearchCV(RForestC, param_grid = RFC_param_grid, cv=Stratified_K_Fold, scoring='accuracy', n_jobs = -1, verbose = 1)
gsRForestC.fit(X_train,y_train)
RForestC_best = gsRForestC.best_estimator_
# Best score
gsRForestC.best_score_
# Gradient boosting Classifier Parameters tunning
GBC = GradientBoostingClassifier()
GBC_param_grid = {'loss' : ["deviance"],
'n_estimators' : [100,300],
'learning_rate': [0.01, 0.1],
'max_depth': [5, 10],
'min_samples_leaf': [2,10],
'max_features': ['log2',None],
}
gsGBC = GridSearchCV(GBC, param_grid = GBC_param_grid, cv = Stratified_K_Fold, scoring='accuracy', n_jobs = -1, verbose = 1)
gsGBC.fit(X_train,y_train)
GBC_best = gsGBC.best_estimator_
# Best score
gsGBC.best_score_
# Support Vector Classifier Parameters tunning
svc_param_grid = {'kernel': ['rbf'],
'gamma': [ .25, .5, .75, 1.0],
'C': [0.1, 1, 10],
'class_weight':['balanced']}
gsSVC = GridSearchCV(SVC(probability = True, cache_size = 4000), param_grid = svc_param_grid, cv = Stratified_K_Fold, scoring='accuracy', n_jobs = -1, verbose = 1)
gsSVC.fit(X_train,y_train)
SVC_best = gsSVC.best_estimator_
# Best score
gsSVC.best_score_
#Reference for this plot - http://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html#
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
n_jobs=-1, train_sizes=np.linspace(.1, 1.0, 5)):
plt.figure(figsize=(10, 6))
plt.title(title)
if ylim is not None:
plt.ylim(*ylim)
plt.xlabel('Training examples')
plt.ylabel('Score')
train_sizes, train_scores, test_scores = learning_curve(
estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color='coral')
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color='skyblue')
plt.plot(train_sizes, train_scores_mean, 'o-', color='coral',
label='Training score')
plt.plot(train_sizes, test_scores_mean, 'o-', color='skyblue',
label='Cross-validation score')
plt.legend(loc = 'best')
return plt
g = plot_learning_curve(ada_best,'AdaBoost learning curves',X_train,y_train,cv=Stratified_K_Fold)
g = plot_learning_curve(RForestC_best,'RandomForest learning curves',X_train,y_train,cv=Stratified_K_Fold)
g = plot_learning_curve(GBC_best,'GradientBoosting learning curves',X_train,y_train,cv=Stratified_K_Fold)
g = plot_learning_curve(SVC_best,'SVC learning curves',X_train,y_train,cv=Stratified_K_Fold)
The learning curve is a good tool to check whether a machine learning model is overfitting or underfitting. Surprisingly, both AdaBoost and Random Forest started underfitting the training dataset, but the good news is the learning curves gradually converge with the cross-validation scores as the training examples increase. On the other hand, Gradient Boosting overfits the training set at first, then converge with the cross-validation score. So the parameters tuning for adaboost, random forest, and gradient boosting were somewhat successful (Not saying the prediction scores are good, but this is probably the best these models can do). However, SVC still overfits the training set (achieving the perfect score all the time) and never converge with the cross-validation score. Thus, further tunning of the parameters are needed in the future.
I want to see what features contribute the most to distinguishing between the good and bad loans. One of the advantages of tree-based methods (in this case adaboost, random forest and gradient boosting) are they are explainable. We can easily visualize either how trees grow and make decisions or the importance of features when trees are constructed. Below, I plot the feature importance for the three tuned tree-based methods.
best_estimators = [('AdaBoost', ada_best), ('RandomForest', RForestC_best), ('GradientBoosting',GBC_best)]
fig,axarr = plt.subplots(3, 1,figsize=(16, 27))
fig.subplots_adjust(right=0.95,top=0.97, bottom=0.04, hspace=0.15, wspace=0.07,left=0.05)
def NumberIn(String):
if String.split('_')[-1].isdigit() is True:
return True
for row, estimator in enumerate(best_estimators):
a=[]
# Recall in section 3.3, I used binary encoding to deal with the categorical features.
# When plotting the feature importance graph, instead of having features like 'purpose_0, purpose_1, etc',
# I sum them up as one combined feature using their original feature name.
for f in X_train.columns:
if NumberIn(f) is True:
a.append(f.rsplit('_',1)[0])
else:
a.append(f)
importances = pd.DataFrame({'features':a, 'importance': estimator[1].feature_importances_})
importances = importances.groupby(['features'],as_index = False).sum().sort_values('importance',ascending = False)
g = sns.barplot(x = importances['importance'], y = importances['features'], orient='h', palette= 'plasma', ax = axarr[row])
g.set_xlabel('Feature Importance',fontsize = 12)
g.set_ylabel('Features',fontsize = 12)
g.tick_params(labelsize = 12)
g.set_title(estimator[0] + ' Feature Importance')
# Plot of the feature importance without summing up binary encoded features
"""
best_estimators = [('AdaBoost', ada_best), ('RandomForest', RForestC_best), ('GradientBoosting',GBC_best)]
fig,axarr = plt.subplots(3, 1,figsize=(16, 27))
fig.subplots_adjust(right=0.95,top=0.97, bottom=0.04, hspace=0.15, wspace=0.07,left=0.05)
for row, estimator in enumerate(best_estimators):
importances = pd.DataFrame({'features':X_train.columns, 'importance': estimator[1].feature_importances_}).sort_values('importance',ascending = False)
g = sns.barplot(x = importances['importance'], y = importances['features'], orient='h', palette= 'plasma', ax = axarr[row])
g.set_xlabel('Feature Importance',fontsize = 12)
g.set_ylabel('Features',fontsize = 12)
g.tick_params(labelsize = 12)
g.set_title(estimator[0] + ' Feature Importance')
"""
Random forest and Gradient boosting both consider 'interest rate, grade, annual income, term length' as their top 4 most important indicators for the loan status. This is consistent with my findings earlier that higher interest rate and lower grade would more likely result in bad loans. The percentage of bad loans is higher among the longer loan term. I was not so sure that the slight difference in the median annual income between good and bad loans is that indicative. However, gradient boosting thinks this is important. Personally, I would agree more with the ranking from 'Random forest'. Loan amount is actually less useful in determining the loan status as one might think. The same goes for the borrowers' address state. On the other hand, features like 'home ownership, employment length, number of accounts etc' are far less importance according to the models. This is also not too surprising as we have discussed before.
AdaBoost ranks the feature importance in a way a little different from the other two. Although it also thinks the interest rate and annual income as two of the most important ones, it ranks features like 'purpose', 'address state' much higher than the other two methods. On the other hand, it does not even take 'grade' and 'loan amount' into consideration. Despite ranking features a little differently, AdaBoost achieves similar cross-validation score with others, and in fact, it produces the best prediction on the test dataset by a small margin.
best_estimators = [('AdaBoost', ada_best), ('RandomForest', RForestC_best), ('GradientBoosting',GBC_best),('SVC', SVC_best)]
for estimator in best_estimators:
predictions = estimator[1].score(X_test,y_test)
print estimator[0] + ' mean accuracy'
print predictions
AdaBoost leads in the prediction on the test dataset. However, as I have discussed at the beginning of section 4.3, all trained models could not produce much better results than the baseline accuracy. After excluding features like 'total received principles' that could cause a potential data leakage problem, there are not too many useful information that could help us predict loan status with high accuracy. I might try other methods like XGBoost etc., and further tune the SVC in the future for the sake of learning and practicing. I will try to update this notebook as soon as I have new results.
!jupyter nbconvert Lending-club-loan-data-analysis.ipynb --to html